-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add automatic fingerprinting for Term records #138
Conversation
Why these changes are being introduced: We currently use fingerprinting for suggested resources. We are planning to expand this functionality to terms to support clustering, and we also need to refactor suggested resource fingerprinting to accommodate multiple phrases/fingerprints per suggested resource. This felt like a good opportunity to reevaluate our fingerprinting implementation. Relevant ticket(s): * [TCO-74](https://mitlibraries.atlassian.net/browse/TCO-74) * [Add automatic fingerprinting for Term records (unticketed -- link to PR)](#138) How this addresses that need: This ADR proposes that we add a central `Fingerprint` model, linked to `Terms`, which is leveraged by `SuggestedResource` and any other detectors that require fingerprinting. EngX had discussed this as a possible approach in a recent meeting. Consensus on this ADR would confirm the decision. Side effects of this change: See ADR for details.
41184d0
to
9002903
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Blocking:
- I'd like to see some unit tests for the fingerprint algorithm.
Suggestions (non-blocking):
- rename TermFingerprint to Fingerprint
- move fingerprinting logic out of Term and into Fingerprint (I'm nearly blocking on this so please consider it strongly and if you don't think it is the correct path let's chat so I can understand why)
- use
assert_nil
test/models/term_fingerprint_test.rb
Outdated
|
||
assert_equal term_count, Term.count | ||
assert_equal fingerprint_count - 1, TermFingerprint.count | ||
assert_instance_of NilClass, target_term.term_fingerprint |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You may want to consider assert_nil
here:
https://ruby-doc.org/3.0.5/gems/minitest/Minitest/Assertions.html#method-i-assert_nil
Both work, but I think the more explicit assertion is slightly more efficient to mentally process (for me at least)
app/models/term.rb
Outdated
# This is similar to the SuggestedResource fingerprint method, with the exception that it also replaces " with " | ||
# during its operation. This switch may also need to be added to the SuggestedResource method, at which point they can | ||
# be abstracted to a helper method. | ||
def calculate_fingerprint |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see any tests that confirm this logic works as expected.
You can likely snag them from https://github.com/MITLibraries/tacos/blob/main/test/models/detector/suggested_resource_test.rb
as we'll be refactoring and removing those tests soon to use the Term fingerprints.
This feels too important to hide behind "we don't test private methods" to me. If you don't test it directly (which is fine for a private method), we still should test the effects of it by testing the generated fingerprints on a Term (although it's probably slightly easier to just test it directly).
I'm also curious if this might better belong in the Fingerprint model itself. It feels slightly odd to have a model dedicated to fingerprints but have the actual logic in Term.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thankfully, ending up with an instance method for this logic means that we can't hide behind "we don't test private methods" anymore. Either way, I agree that tests are needed here - not sure why I didn't write them initially.
I've tried to make clear in the test structure that there's a test for every line in the method, to make sure we keep focus on its components rather than trying to manage one all-encompassing example. Hopefully future-us will continue to keep up if this logic changes.
app/models/term_fingerprint.rb
Outdated
# created_at :datetime not null | ||
# updated_at :datetime not null | ||
# | ||
class TermFingerprint < ApplicationRecord |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now that we have decided to only have one set of stored fingerprints, it feels worth considering whether to name this Fingerprint
instead of TermFingerprint
. This is non-blocking, but if you choose to not make this change I'd like to hear why you prefer it this way.
As noted elsewhere, I'd prefer if we can move the logic to calculate a fingerprint to this model rather than keep it in Term.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you have thoughts about naming the class versus naming the field? Using the same name for both feels like it would get awkward, particularly from the perspective of the Term model. term.fingerprint
feels like the method to call to get the calculated string, but if the class is also Fingerprint then things get a bit weird.
For now I'm using .fingerprint_value
as the Term method, but I'm not sure that's going to be the case when it gets to code review.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One thing I'm particularly interested in your feedback about is the naming being used here. Having a Fingerprint
model, and a fingerprint
field, feels awkward, particularly from the Term perspective. I've tried to resolve this by adding the .fingerprint_value
construct that delegates down to the Fingerprint model, and allows the Term to just get access to the string value somehow.
That said, I don't think this gets used anywhere yet - so maybe it would makes sense to remove it until the UI or something leverages it.
All of those suggestions look good to me, thanks - I'll try to get these wrapped up either today or Monday. |
My question about moving TermFingerprint to Fingerprint (which makes sense on paper) is whether I should retcon the two migrations in this PR to just use that name, or instead create new migrations to drop this table and create the new one. I suspect that the answer is to retcon, and just re-create the review app for this and advise everyone to rollback / migrate carefully - but that feels like something to confirm with you and Adam. |
@matt-bernhardt I'd go with retcon. Make sure to rollback your local database before doing that or the retconned tables will never go away for you. And yeah, for the PR build just make a new one to avoid any confusion. |
I'm putting this into draft mode while I work on this refactoring. |
9002903
to
0930853
Compare
Okay @JPrevost - I think the fingerprinting PR is once again ready for a review. I think I've picked up the things you asked for changes on, and I think things are better as a result. Let me know what you think? |
add_reference :terms, :fingerprint, foreign_key: true | ||
|
||
# Seed the relationship between Terms and Fingerprints | ||
Term.all.each do |t| |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This took a very long time and then failed locally for me with a large set of Terms loaded.
I'd like to load the full set of Terms/SearchEvents into stage before we merge to confirm we won't have issues in prod. That is likely blocked by stage db size so I'll pull that ticket to unblock this work.
For now I'll send you the error I saw via Slack and I suggest you load the full set from prod (you have to load in batches) so you can try to replicate the problem I ran into. Loading in batches by month is how I've approached this as we can't export the full set at this time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Additionally, I want to consider not doing this large data process in a migration (I know, I know) and instead making it a rake task we run after migration. If I understand this work, fingerprint is not required and thus the migration should be fine without it and we can have a task that regenerates all fingerprints to run manually after this work lands both to generate the full set of fingerprints initially but also in case our fingerprint algorithm ever changes this would allow us to regenerate them.
76da216
to
00b909c
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Non-blocking: I'd like us to consider Fingerprint.value
instead of Fingerprint.fingerprint
. It's not great, but it feels slightly less confusing.
Also non-blocking, but consider whether the default state in fixtures of Terms having no Fingerprints feels correct to you.
If you don't intend to make either of those changes just let me know and I'll approve as the code works as needed. Thanks!
app/models/fingerprint.rb
Outdated
# Table name: fingerprints | ||
# | ||
# id :integer not null, primary key | ||
# fingerprint :string |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What if we used value
here? It is not ideal, but it feels less awkward than Term.fingerprint.fingerprint
and doesn't hide behind the alias.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That seems worth trying, yeah - I'll take a run at it, and see if I can make it work.
** Why are these changes being introduced: After watching our search traffic for a few months, we have seen some terms come in that are clearly related: * 'Scientific American' * 'scientific american' * '"Scientific american"' While there are valid reasons to store the term exactly as the user has submitted it - there are also good reasons to standardize these values, to look for clusters and related terms. ** Relevant ticket(s): n/a ** How does this address that need: This defines a TermFingerprint model, which is related to the Term model via a belongs_to relationship (a Term belongs to its TermFingerprint, and a TermFingerprint can have many Terms). The migration to define this model includes a migration to populate records for Terms we've already received. The Term model gets two lifecycle hooks, one to add new fingerprints for every term, and another to delete a TermFingerprint if no remaining Terms reference it. Beyond the methods needed to handle these creations and deletions in as seemless a way as possible, we also add a .cluster method to the Term model, which will return an array of all related Terms (but not the Term itself - so a Term that has a unique fingerprint will return an empty array). This method will be used as part of a future inspection UI. Building the models in this way will also allow for querying based on shared fingerprints - for example by adding a :has_many_terms scope on the TermFingerprint model. ** Document any side effects to this change: Two things: 1. A quirk of this implementation is that it is possible to delete a TermFingerprint record, which does not delete the related Terms (and thus also SearchEvents, Detections, and Categorizations). The Term will then have a null fingerprint. The next time the Term is saved, its fingerprint will be regenerated. No other operation should be impacted by this arrangement. For the life of me I can't anticipate why we might delete a fingerprint - but the relationship needs to be optional to avoid further pain when working in the console. 2. The process for calculating this TermFingerprint value is similar - but not identical - to that for handling the SuggestedResource records. The difference is that the TermFingerprint method removes """ sequences - which might need to be added to SuggestedResource. Both methods should probably be abstracted out to a shared helper method, honestly.
I noticed this when running make lint - not sure how it slipped past my notice when working on the confirmation work.
Rename migration to get date sort correct Run migrations
** Why are these changes being introduced: It makes sense that the Fingerprint model is where the logic to calculate a fingerprint value is maintained, and not in the Term model. ** Relevant ticket(s): code review ** How does this address that need: This moves the fingerprint logic from the Term to Fingerprint model. Because of this, the Term model changes to use its new location (from inside the lifecycle hook where the Fingerprint record gets created). We use an instance method for this, because at the time of use there is not yet a Fingerprint record that could call the method internally. We also copy-paste the tests for this method from the SuggestedResource implementation (which should be removed in a future PR). ** Document any side effects to this change: I can think of two possible side effects: 1. the SuggestedResource implementation of the fingerprint is now very much duplicative, and should be removed (coming in a future ticket) 2. It is a bit awkward to rely on an instance method for calculating the fingerprint value. In an ideal case, register_fingerprint would use something like: self.fingerprint = Fingerprint.find_or_create_by(phrase) ... and then the Fingerprint model would: a. receive the phrase argument b. calculate the fingerprint for this phrase c. look up to see if such a record exists already, and return it d. create the record if none exists, and return _that_ A custom initialize method feels like it would work for this, but I think that's considered an antipattern in Rails? For now, I think this works.
This comments out the initial calculation of Fingerprint records from the migrations in this PR. A following commit will add a rake task that will focus on this work. A side effect of this work is that we need to change the fixtures used for our test suite, to match the state the migrations will produce. This in turn causes some tests to change, because some tests will require additional setup work. More complex tests now have code comments to indicate this sort of sequencing.
Includes a very rudimentary test, and finally removing the (commented out) block from the db migration.
Three tests need to be updated to account for the different starting position.
Until now, we've been building the Fingerprint model to include a field also named "fingerprint" for the actual string value of the fingerprint. This removes that duplication by changing the field name to be "value". As a result of updating the migration with this new terminology, we also update the alias / delegation up to the Term model, so terms can still find the string via the term.fingerprint_value attribute. The fixtures use the updated field name, and some references to the new field name (in application code and in tests) are updated.
eec2205
to
051e7d7
Compare
@JPrevost I've added two commits this afternoon, changing the |
|
||
validates :value, uniqueness: true | ||
|
||
alias_attribute :fingerprint_value, :value |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Non blocking. I'm not sure this alias will be useful. Term.fingerprint.value
vs Term.fingerprint_value
. If you like the ergonomics of that that definitely keep it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For now, I think I like the ergonomics of having it directly on the Term
model. This may turn out to be something we remove later, which would be fine, but I'd rather deal with that down the road at this point.
new_record = { | ||
value: Fingerprint.calculate(phrase) | ||
} | ||
self.fingerprint = Fingerprint.find_or_create_by(new_record) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Non blocking. You may intentionally be making the hash, but I wanted to share it is not necessary and you could instead just pass the key and value directly without the intermediary object creation.
self.fingerprint = Fingerprint.find_or_create_by(value: Fingerprint.calculate(phrase))
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This two-step process was initially a reaction to a one-step process that used some variation on the word "fingerprint" about half a dozen times, which bordered on the nonsensical to read even 30 seconds later.
While I agree that this isn't necessary, I'm inclined to keep it in this format, at least for now. Future us can refactor if it continues to bug us.
This adds a TermFingerprint model, which allows for clustering of terms by a shared fingerprint. Full details are in the commit message.
Developer
Ticket(s)
This started as an experiment that kept seeming promising, so I didn't create a ticket for it. Now that I'm writing the PR text, I think maybe I missed an opportunity to do this - but am not going to paper over that decision now without checking in with the project team.
https://mitlibraries.atlassian.net/browse/TCO-124
Accessibility
all issues introduced by these changes have been resolved or opened
as new issues (link to those issues in the Pull Request details above)
Documentation
ENV
Stakeholders
Dependencies and migrations
NO dependencies are updated
YES migrations are included
Reviewer
Code
added technical debt.
Documentation
(not just this pull request message).
Testing